Documentation Index
Fetch the complete documentation index at: https://mintlify.com/FrankDevg/imbd_scrapper_project/llms.txt
Use this file to discover all available pages before exploring further.
The IMDb Scraper is fully containerized using Docker Compose, enabling one-command deployment with all dependencies included.
Architecture Overview
The Docker environment consists of four main services:
- postgres: PostgreSQL 15 database for structured data storage
- tor: TOR proxy for IP rotation and anonymity
- vpn: Gluetun VPN container for geolocation changes
- scraper: The main application container
Docker Compose Configuration
The complete orchestration is defined in docker-compose.yml:
version: "3.9"
services:
postgres:
image: postgres:15
container_name: imdb_postgres
restart: always
environment:
POSTGRES_DB: ${POSTGRES_DB}
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
ports:
- "${POSTGRES_PORT}:5432"
volumes:
- pgdata:/var/lib/postgresql/data
- ./sql:/docker-entrypoint-initdb.d
networks:
- app_net
tor:
image: dperson/torproxy
container_name: tor_proxy
restart: always
ports:
- "9050:9050" # SOCKS port for traffic
- "9051:9051" # Control port for commands (IP rotation)
command: >
sh -c "tor --SocksPort 0.0.0.0:9050 --ControlPort 0.0.0.0:9051 --HashedControlPassword '' --CookieAuthentication 0"
networks:
- app_net
vpn:
image: qmcgaw/gluetun
container_name: vpn
cap_add:
- NET_ADMIN
environment:
- VPN_SERVICE_PROVIDER=protonvpn
- OPENVPN_USER=${VPN_USERNAME}
- OPENVPN_PASSWORD=${VPN_PASSWORD}
- SERVER_COUNTRIES=Argentina
ports:
- "8888:8888"
networks:
- vpn_net
scraper:
build:
context: .
dockerfile: Dockerfile
container_name: imdb_scraper
depends_on:
- postgres
- tor
- vpn
networks:
- app_net
- vpn_net
env_file:
- .env
volumes:
- .:/app
command: >
sh -c "
echo 'Waiting for Postgres to be ready...' &&
while ! nc -z postgres 5432; do sleep 1; done &&
echo 'Postgres ready. Starting scraper...' &&
echo 'Waiting for Tor to be ready...' &&
while ! nc -z tor 9050; do sleep 1; done &&
echo 'SOCKS port ready. Tor fully initialized.' &&
python presentation/cli/run_scraper.py &&
echo 'Scraper finished. Running queries.sql...' &&
PGPASSWORD=$POSTGRES_PASSWORD psql -h postgres -U $POSTGRES_USER -d $POSTGRES_DB -f sql/queries.sql &&
echo 'SQL queries executed. Keeping container active...' &&
tail -f /dev/null
"
volumes:
pgdata:
networks:
app_net:
driver: bridge
vpn_net:
driver: bridge
Dockerfile Breakdown
The scraper container is built from this Dockerfile:
FROM python:3.11-slim
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /app
# Install system dependencies including postgresql-client
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
libpq-dev \
tor \
curl \
netcat-openbsd \
gnupg \
ca-certificates \
postgresql-client \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the project
COPY . .
# Default command
CMD ["python", "presentation/cli/run_scraper.py"]
Key Components
Uses python:3.11-slim for a lightweight Python environment with minimal footprint.
- gcc: Required for compiling Python packages with C extensions
- libpq-dev: PostgreSQL development libraries for psycopg2
- tor: TOR network client
- netcat-openbsd: Network utility for health checks
- postgresql-client: CLI tools for database operations
Installed from requirements.txt without cache to reduce image size.
Service Dependencies
The scraper container has explicit dependencies:
depends_on:
- postgres
- tor
- vpn
And includes health checks in the startup command:
- Waits for PostgreSQL on port 5432
- Waits for TOR SOCKS proxy on port 9050
- Only starts scraping when both services are responsive
Volume Mounts
Database Persistence
pgdata:/var/lib/postgresql/data ensures PostgreSQL data survives container restarts
SQL Initialization
./sql:/docker-entrypoint-initdb.d auto-runs SQL scripts on first startup
Application Code
.:/app mounts the entire project for development (can be removed in production)
Build and Run
Initial Build
docker-compose build --no-cache
This builds the scraper image from scratch, ensuring all dependencies are fresh.
Start All Services
Or run in detached mode:
View Logs
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f scraper
Stop Services
Rebuild After Changes
docker-compose down
docker-compose build --no-cache
docker-compose up
Port Mappings
| Service | Internal Port | External Port | Purpose |
|---|
| postgres | 5432 | $ | Database connections |
| tor | 9050 | 9050 | SOCKS proxy traffic |
| tor | 9051 | 9051 | Control port (IP rotation) |
| vpn | 8888 | 8888 | VPN HTTP proxy |
Accessing Services
PostgreSQL Database
Connect from host machine:psql -h localhost -p 5432 -U aruiz -d imdb_scraper
Scraper Logs
Real-time logs available in logs/scraper.log or via Docker:docker logs -f imdb_scraper
Generated Data
CSV files appear in data/ directory:
movies.csv
actors.csv
movie_actor.csv
Troubleshooting
If the scraper exits immediately, check that:
- The
.env file exists with all required variables
- PostgreSQL container is healthy:
docker ps
- TOR proxy is responding:
docker logs tor_proxy
Common Issues
Database Connection Failed
# Check if postgres is running
docker ps | grep postgres
# View postgres logs
docker logs imdb_postgres
TOR Not Ready
# Verify TOR is listening
docker exec tor_proxy netstat -tuln | grep 9050
VPN Connection Issues
# Check VPN status
docker logs vpn
Production Considerations
Remove Volume Mount
Change - .:/app to only mount necessary files, not entire codebase
Use Secrets
Replace .env file with Docker secrets or external secret management
Health Checks
Add explicit healthcheck directives to docker-compose.yml
Resource Limits
Set memory and CPU limits for each service
Next Steps
Environment Variables
Configure database, proxy, and VPN credentials
Network Configuration
Understand Docker networks and proxy setup